Survey on Comparable Corpora until June 2012

نویسنده

  • Victor Chakraborty
چکیده

Here we present a survey of important work done on Comparable Corpora between the period 1995 to 2012. Unlike parallel corpora, which are clearly defined as translated texts, there is a wide variation of non-parallelism in comparable text. Non-parallelism is manifested in terms of differences in author, domain, topics, time period, language. The most common text corpora have non-parallelism in all these dimensions. The higher the degree of non-parallelism, the more challenging is the extraction of bilingual information. Such a corpus is nevertheless a desirable source of bilingual information, especially for new words. In this report we have first classified the research on comparable corpora into various categories. This is followed by detailed literature survey on comparable corpora and comparability metrics. After that we discuss the work related to the enhancement of comparability metrics in corpus. We conclude with the brief summary of this survey on comparable corpora. 1 Classification of Work on Comparable Corpora We can broadly classify the research on comparable corpora into the following sections. • Correlation based extraction • Vector representation • Classifiers based extraction • Linguistic knowledge based extraction Each of these classes are described in the following sections along with the gist of pioneering work in these domain. 2 Correlation based Extraction Most of the work on comparable corpora is based on correlation between word co-occurrence. They consider the context of a word as a feature to map the source word to the target word. Moreover most of the work based on this idea is focused towards extraction of bilingual lexicons only rather than parallel sentences.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

استخراج پیکره‌ موازی از اسناد قابل‌مقایسه برای بهبود کیفیت ترجمه در سیستم‌های ترجمه ماشینی

Data used for training statistical machine translation method are usually prepared from three resources: parallel, non-parallel and comparable text corpora. Parallel corpora are an ideal resource for translation but due to lack of these kinds of texts, non-parallel and comparable corpora are used either for parallel text extraction. Most of existing methods for exploiting comparable corpora loo...

متن کامل

Measuring Comparability of Multilingual Corpora Extracted from Wikipedia ∗ Midiendo la comparabilidad de copus multilingües extráıdos de la Wikipedia

Comparable corpora can be used for many linguistic tasks such as bilingual lexicon extraction. By improving the quality of comparable corpora, we improve the quality of the extraction. This article describes some strategies to build comparable corpora from Wikipedia and proposes a measure of comparability. Experiments were performed on Portuguese, Spanish, and English Wikipedia.

متن کامل

Ninth Workshop on Building and Using Comparable Corpora Workshop Programme

Comparable corpora are the most versatile and valuable resource for multilingual Natural Language Processing. The speaker will argue that comparable corpora can support a wider range of applications than has been demonstrated so far in the state of the art. The talk will present completed and ongoing work conducted by the speaker and colleagues from his research group where comparable corpora a...

متن کامل

Revising the Compositional Method for Terminology Acquisition from Comparable Corpora

In this paper, we present a new method that improves the alignment of equivalent terms monolingually acquired from bilingual comparable corpora: the Compositional Method with Context-Based Projection (CMCBP). Our overall objective is to identify and to translate high specialized terminology made up of multi-word terms acquired from comparable corpora. Our evaluation in the medical domain and fo...

متن کامل

Wikipedia as an SMT Training Corpus

This article reports on mass experiments supporting the idea that data extracted from strongly comparable corpora may successfully be used to build statistical machine translation systems of reasonable translation quality for in-domain new texts. The experiments were performed for three language pairs: SpanishEnglish, German-English and RomanianEnglish, based on large bilingual corpora of simil...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012